1,724 research outputs found
IST Austria Thesis
Because of the increasing popularity of machine learning methods, it is becoming important to understand the impact of learned components on automated decision-making systems and to guarantee that their consequences are beneficial to society. In other words, it is necessary to ensure that machine learning is sufficiently trustworthy to be used in real-world applications. This thesis studies two properties of machine learning models that are highly desirable for the
sake of reliability: robustness and fairness. In the first part of the thesis we study the robustness of learning algorithms to training data corruption. Previous work has shown that machine learning models are vulnerable to a range
of training set issues, varying from label noise through systematic biases to worst-case data manipulations. This is an especially relevant problem from a present perspective, since modern machine learning methods are particularly data hungry and therefore practitioners often have to rely on data collected from various external sources, e.g. from the Internet, from app users or via crowdsourcing. Naturally, such sources vary greatly in the quality and reliability of the
data they provide. With these considerations in mind, we study the problem of designing machine learning algorithms that are robust to corruptions in data coming from multiple sources. We show that, in contrast to the case of a single dataset with outliers, successful learning within this model is possible both theoretically and practically, even under worst-case data corruptions. The second part of this thesis deals with fairness-aware machine learning. There are multiple areas where machine learning models have shown promising results, but where careful considerations are required, in order to avoid discrimanative decisions taken by such learned components. Ensuring fairness can be particularly challenging, because real-world training datasets are expected to contain various forms of historical bias that may affect the learning process. In this thesis we show that data corruption can indeed render the problem of achieving fairness impossible, by tightly characterizing the theoretical limits of fair learning under worst-case data manipulations. However, assuming access to clean data, we also show how fairness-aware learning can be made practical in contexts beyond binary classification, in particular in the challenging learning to rank setting
Robust Learning from Untrusted Sources
Modern machine learning methods often require more data for training than a
single expert can provide. Therefore, it has become a standard procedure to
collect data from external sources, e.g. via crowdsourcing. Unfortunately, the
quality of these sources is not always guaranteed. As additional complications,
the data might be stored in a distributed way, or might even have to remain
private. In this work, we address the question of how to learn robustly in such
scenarios. Studying the problem through the lens of statistical learning
theory, we derive a procedure that allows for learning from all available
sources, yet automatically suppresses irrelevant or corrupted data. We show by
extensive experiments that our method provides significant improvements over
alternative approaches from robust statistics and distributed optimization.Comment: Accepted to International Conference on Machine Learning (ICML),
2019; Camera-ready versio
Fairness-aware PAC learning from corrupted data
Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading
accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data
limit
The Convergence of Sparsified Gradient Methods
Distributed training of massive machine learning models, in particular deep
neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
Several families of communication-reduction methods, such as quantization,
large-batch methods, and gradient sparsification, have been proposed. To date,
gradient sparsification methods - where each node sorts gradients by magnitude,
and only communicates a subset of the components, accumulating the rest locally
- are known to yield some of the largest practical gains. Such methods can
reduce the amount of communication per step by up to three orders of magnitude,
while preserving model accuracy. Yet, this family of methods currently has no
theoretical justification.
This is the question we address in this paper. We prove that, under analytic
assumptions, sparsifying gradients by magnitude with local error correction
provides convergence guarantees, for both convex and non-convex smooth
objectives, for data-parallel SGD. The main insight is that sparsification
methods implicitly maintain bounds on the maximum impact of stale updates,
thanks to selection by magnitude. Our analysis and empirical validation also
reveal that these methods do require analytical conditions to converge well,
justifying existing heuristics.Comment: NIPS 2018 - Advances in Neural Information Processing Systems;
Authors in alphabetic orde
FLEA: Provably Fair Multisource Learning from Unreliable Training Data
Fairness-aware learning aims at constructing classifiers that not only make
accurate predictions, but do not discriminate against specific groups. It is a
fast-growing area of machine learning with far-reaching societal impact.
However, existing fair learning methods are vulnerable to accidental or
malicious artifacts in the training data, which can cause them to unknowingly
produce unfair classifiers. In this work we address the problem of fair
learning from unreliable training data in the robust multisource setting, where
the available training data comes from multiple sources, a fraction of which
might be not representative of the true data distribution. We introduce FLEA, a
filtering-based algorithm that allows the learning system to identify and
suppress those data sources that would have a negative impact on fairness or
accuracy if they were used for training. We show the effectiveness of our
approach by a diverse range of experiments on multiple datasets. Additionally
we prove formally that, given enough data, FLEA protects the learner against
unreliable data as long as the fraction of affected data sources is less than
half
On the Sample Complexity of Adversarial Multi-Source PAC Learning
We study the problem of learning from multiple untrusted data sources, a
scenario of increasing practical relevance given the recent emergence of
crowdsourcing and collaborative learning paradigms. Specifically, we analyze
the situation in which a learning system obtains datasets from multiple
sources, some of which might be biased or even adversarially perturbed. It is
known that in the single-source case, an adversary with the power to corrupt a
fixed fraction of the training data can prevent PAC-learnability, that is, even
in the limit of infinitely much training data, no learning system can approach
the optimal test error. In this work we show that, surprisingly, the same is
not true in the multi-source setting, where the adversary can arbitrarily
corrupt a fixed fraction of the data sources. Our main results are a
generalization bound that provides finite-sample guarantees for this learning
setting, as well as corresponding lower bounds. Besides establishing
PAC-learnability our results also show that in a cooperative learning setting
sharing data with other parties has provable benefits, even if some
participants are malicious.Comment: International Conference on Machine Learning (ICML) 2020:
Camera-ready. Strengthened the definition of adversarial PAC-learnability,
added explicit bounds on sample complexit
Data Leakage in Federated Averaging
Recent attacks have shown that user data can be recovered from FedSGD
updates, thus breaking privacy. However, these attacks are of limited practical
relevance as federated learning typically uses the FedAvg algorithm. Compared
to FedSGD, recovering data from FedAvg updates is much harder as: (i) the
updates are computed at unobserved intermediate network weights, (ii) a large
number of batches are used, and (iii) labels and network weights vary
simultaneously across client steps. In this work, we propose a new
optimization-based attack which successfully attacks FedAvg by addressing the
above challenges. First, we solve the optimization problem using automatic
differentiation that forces a simulation of the client's update that generates
the unobserved parameters for the recovered labels and inputs to match the
received client update. Second, we address the large number of batches by
relating images from different epochs with a permutation invariant prior.
Third, we recover the labels by estimating the parameters of existing FedSGD
attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate
that on average we successfully recover >45% of the client's images from
realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5
images, compared to only <10% using the baseline. Our findings show many
real-world federated learning implementations based on FedAvg are vulnerable
Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
Collaborative learning techniques have the potential to enable training
machine learning models that are superior to models trained on a single
entity's data. However, in many cases, potential participants in such
collaborative schemes are competitors on a downstream task, such as firms that
each aim to attract customers by providing the best recommendations. This can
incentivize dishonest updates that damage other participants' models,
potentially undermining the benefits of collaboration. In this work, we
formulate a game that models such interactions and study two learning tasks
within this framework: single-round mean estimation and multi-round SGD on
strongly-convex objectives. For a natural class of player actions, we show that
rational clients are incentivized to strongly manipulate their updates,
preventing learning. We then propose mechanisms that incentivize honest
communication and ensure learning quality comparable to full cooperation.
Lastly, we empirically demonstrate the effectiveness of our incentive scheme on
a standard non-convex federated learning benchmark. Our work shows that
explicitly modeling the incentives and actions of dishonest clients, rather
than assuming them malicious, can enable strong robustness guarantees for
collaborative learning.Comment: Accepted to NeurIPS 2023; 37 pages, 5 figure
On the sample complexity of adversarial multi-source PAC learning
We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is
known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily
corrupt a fixed fraction of the data sources. Our main results are a generalization bound that provides finite-sample guarantees for this learning setting, as well as corresponding lower bounds. Besides establishing PAC-learnability our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some
participants are malicious
Differential cross section measurements for the production of a W boson in association with jets in proton–proton collisions at √s = 7 TeV
Measurements are reported of differential cross sections for the production of a W boson, which decays into a muon and a neutrino, in association with jets, as a function of several variables, including the transverse momenta (pT) and pseudorapidities of the four leading jets, the scalar sum of jet transverse momenta (HT), and the difference in azimuthal angle between the directions of each jet and the muon. The data sample of pp collisions at a centre-of-mass energy of 7 TeV was collected with the CMS detector at the LHC and corresponds to an integrated luminosity of 5.0 fb[superscript −1]. The measured cross sections are compared to predictions from Monte Carlo generators, MadGraph + pythia and sherpa, and to next-to-leading-order calculations from BlackHat + sherpa. The differential cross sections are found to be in agreement with the predictions, apart from the pT distributions of the leading jets at high pT values, the distributions of the HT at high-HT and low jet multiplicity, and the distribution of the difference in azimuthal angle between the leading jet and the muon at low values.United States. Dept. of EnergyNational Science Foundation (U.S.)Alfred P. Sloan Foundatio
- …